Fundamentals of Arthroscopic Surgery Training and beyond: a reinforcement learning exploration and benchmark

Purpose This work presents FASTRL, a benchmark set of instrument manipulation tasks adapted to the domain of reinforcement learning and used in simulated surgical training. This benchmark enables and supports the design and training of human-centric reinforcement learning agents which assist and evaluate human trainees in surgical practice. Methods Simulation tasks from the Fundamentals of Arthroscopic Surgery Training (FAST) program are adapted to the reinforcement learning setting for the purpose of training virtual agents that are capable of providing assistance and scoring to the surgical trainees. A skill performance assessment protocol is presented based on the trained virtual agents. Results The proposed benchmark suite presents an API for training reinforcement learning agents in the context of arthroscopic skill training. The evaluation scheme based on both heuristic and learned reward functions robustly recovers the ground truth ranking on a diverse test set of human trajectories. Conclusion The presented benchmark enables the exploration of a novel reinforcement learning-based approach to skill performance assessment and in-procedure assistance for simulated surgical training scenarios. The evaluation protocol based on the learned reward model demonstrates potential for evaluating the performance of surgical trainees in simulation. Supplementary Information The online version contains supplementary material available at 10.1007/s11548-024-03116-z.

The hardware platform that serves as basis for the RL benchmark is the Vir-taMed FAST simulator as shown in fig.9.The simulator consists of a hollow dome structure with instrument entry portals for a selection of arthroscopic surgical tools (scope, hook, grasper).The output of the virtual arthroscopic camera as well as an optional third-person view of the dome structure are simulated on screen.The evaluation dataset for the performance ranking experiments (section 4) has been recorded using the hardware platform.Figure 10 depicts Laparos, the simulator platform adapted for laparoscopic surgery, which was used to record the expert and user demonstrations in section 4.

A.1 Computational resources
The experiments were executed on a GPU server equipped with Intel Xeon Gold 6150 and AMD EPYC 7742 CPUs and NVIDIA GTX1080Ti GPUs.The average running time per experiment are shown below.For the experiments in section 4, we use an extension of the Unity ml-agents framework as algorithmic implementation.

B API description
We provide an interface to algorithm implementations of two popular frameworks: an extension of the ml-agents [21] framework for the PPO with curiosity exploration algorithm and the inverse RL algorithms as well as stable-baselines3 [22] for a more broad selection of baseline forward RL algorithms.Furthermore, the simulation parameters such as heuristic reward weights, target positions and curricula are exposed to the user as well.For the full documentation, we refer the reader the project website: fastrl.ethz.ch.

C Dataset details
FAST Dataset.For the purpose of skill performance evaluation, a dataset was obtained by subjects performing the three benchmark tasks using the simulator platform.Five subjects with different levels of expertise were invited to perform each of the three benchmark tasks for a total number of 5 repetitions.The trajectories were recorded according to the state space specification described in section 3.2.Two subjects were already very familiar with all tasks and are considered as Experts, while others with little or no experience with the simulation are considered as Novices as shown in table 5. Additionally, the participants are graded based on the length of their performed trajectories.This metric coincides with the default heuristic used by the VirtaMed simulator platform.For the evaluation performed in section 4, the subjects used the physical hardware described in appendix A, recorded using a magnetic sensor, which tracks instrument movement.In the second evaluation, the keyboardand-mouse interface was used to record the trajectories.Laparoscopic diagnostic tour dataset.The diagnostic tour dataset used in section 4 was gathered using the laparoscopy simulator, where users were asked to perform a guided diagnostic tour of the abdomen with a fixed sequence of anatomical landmarks highlighted on screen.A total of 100 trajectories from a diverse set of experienced practitioners were obtained.The average reported proficiency based on the simulator evaluation (max.attainable score 150) was 128.24 ± 18.32 points.FASTRL

D Additional results
This section provides additional results obtained using the algorithmic pipeline in fig. 1 and partially described in section 4.

D.1 Ablation study on reward components
In this set of experiments, we demonstrate the dependence of the forward RL methods on the reward shaping scheme.Due to the intricate task structure which includes a combination of multiple optimisation objectives, we introduce a number of additional reward components in order to improve the sample efficiency of the algorithms solving the tasks.We perform an ablation study on the reward components as shown in fig.11 for the three tasks in order to demonstrate the necessity of reward shaping.The evaluation is carried out using the on-policy version of the policy algorithm (PPO).The five ablation settings comprise the baseline reward structure (baseline), the removal of distance and angle potentials respectively (noDist, noAngle, noDistNoAngle) and the sparse reward setting where a reward is only given if the target is visualised correctly (onlyTaskCompleted).The plots report the median of tasks completed over 5 randomly seeded repetition experiments per ablation setting.We can observe that potential based reward shaping on both Cartesian and angular components is crucial to obtain a policy which solves all targets successfully.We can observe a similar behaviour on the Periscoping task.In the case of TraceLines, the gradation is less distinct due to the specifics of the task.The visual components of the reward are statistically more prevalent in the TraceLines task and allow for a more dense reward structure also in absence of the distance and angle penalties.The sparse reward signal (onlyTaskCompleted) fails to solve a single target.E Algorithmic details and SB3

E.1 stable-baselines3 Results
We have evaluated a number of standard RL algorithm implementations of the stable-baselines3 on our benchmark.Due to a high variance and a lack of tuned hyperparameters, we report the number of tasks completed by the best model from a set trained over 5 random seeds for every algorithm using a standard set of hyperparameters provided by the stable-baselines3 framework.FASTRL

E.2 Algorithm hyperparameters (ml-agents)
This section provides an overview of the hyperparameters used to train the algorithms described in section 4. FASTRL

Table 4 :
Overview of the average algorithm runtimes (in minutes per 1M environment interactions) for the different benchmark environments.

Table 5 :
Participant performance ranking

Table 6 :
Trajectory scores (rewards and values normalised) for human and virtual agents on the ImageCentering task

Table 7 :
Trajectory scores (rewards and values normalised) for human and virtual agents on the TraceLines task

Table 8 :
Normalised scores (heuristic and learned rewards and values) for human trajectories Expert, Intermediate and Poor on ImageCentering task

Table 9 :
Number of tasks completed using stable-baselines3 algorithm implementationsE.3HeuristicrewardstructureThereward structure used for training in the forward modality uses a scalarisation approach based on a number of objectives.We present an overview of the individual objective terms in table 12.The objectives consist of dense reward potentials in positional and rotational spaces (distance penalty and angle penalty in table12) as well a number of binary reward components such as various visualisation targets which vary across tasks.

Table 11 :
SAC hyperparameters used for section 4 experiments

Table 12 :
Heuristic reward structure